-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rptest/datalake: test metadata interoperability with 3rd party system #24643
base: dev
Are you sure you want to change the base?
rptest/datalake: test metadata interoperability with 3rd party system #24643
Conversation
6f1f148
to
6e607dd
Compare
Retry command for Build#60089please wait until all jobs are finished before running the slash command
|
CI test resultstest results on build#60089
test results on build#60100
test results on build#60178
|
1292bb1
to
6e1e926
Compare
def run_sample_maintenance_task(self, namespace, table) -> None: | ||
# Metadata query | ||
# https://iceberg.apache.org/docs/1.6.1/spark-queries/#files | ||
initial_parquet_files = self.run_query_fetch_one( | ||
f"SELECT count(*) FROM {namespace}.{table}.files")[0] | ||
|
||
# Want at least 2 files to be able to assert that optimization did something. | ||
assert initial_parquet_files >= 2, f"Expecting at least 2 files, got {initial_parquet_files}" | ||
|
||
# Spark Procedures provided by Iceberg SQL Extensions | ||
# https://iceberg.apache.org/docs/1.6.1/spark-procedures/#rewrite_data_files | ||
self.run_query_fetch_one( | ||
f"CALL `redpanda-iceberg-catalog`.system.rewrite_data_files(\"{namespace}.{table}\")" | ||
) | ||
|
||
optimized_parquet_files = self.run_query_fetch_one( | ||
f"SELECT count(*) FROM {namespace}.{table}.files")[0] | ||
assert optimized_parquet_files < initial_parquet_files, f"Expecting fewer files after optimize, got {optimized_parquet_files}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO these service classes should only hold operations, not validations. To that end, can we split this into some count_parquet_files()
and optimize_parquet_files()
? There are many different kinds of maintenance https://iceberg.apache.org/docs/1.5.1/maintenance/ and it'll be much easier to reuse if we're more descriptive and have smaller, tighter primitives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrwng I see what you mean but here I wanted to give more freedom to these maintenance tasks implementation, i.e. it might as well be a metadata only maintenance. I documented the expectations in the code doc
def run_sample_maintenance_task(self, namespace, table) -> None: |
optimize_parquet_files
and count_parquet_files
would work for the 2 engines we have now, do you reckon all of them expose the same primitives so that we could have a single test for all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that run_sample_maintenance_task()
seems a bit vague, whereas the iceberg docs Andrew provided has a concrete list of tasks we could enumerate, these being:
Maybe these are the primitives we should be working with and leave the assertions to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these are the primitives we should be working with and leave the assertions to the user.
Assertions like ...? I don't understand what the exact proposal is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe Willem's referring to assert statements in run_sample_maintenance_task()
-- to my earlier point, I feel strongly that they don't belong in this Spark service library. They belong in test bodies, or other test verifiers.
here I wanted to give more freedom to these maintenance tasks implementation...do you reckon all of them expose the same primitives so that we could have a single test for all of them?
I can see how it's a nice property to have, that most query engines can implement some kind of maintenance, and so would be able to slot into the new test. To your point, I don't think every engine will always support the exact set of maintenance operations I posted/suggested. IMO once we come across such an engine, we should decide what to do at that time, rather than preemptively introducing a loosely defined primitive -- it's hard for me to imagine this maintenance function will be used widely by test authors.
If you insist that "arbitrary maintenance" belongs here, then please also at least split out optimize_parquet_files
and count_parquet_files
(or your preferred names) into the Spark and Trino services and have run_sample_maintenance_task()
call them.
Will be used for testing interoperability with 3rd party vendors.
4b21a17
to
92f74aa
Compare
Retry command for Build#60178please wait until all jobs are finished before running the slash command
|
got a merge conflict |
cursor.execute(query) | ||
yield cursor | ||
finally: | ||
cursor.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this redundant? it seems like it was handling the cursor closing, and the outer try block was handling the client closing.
@@ -60,3 +60,16 @@ def max_translated_offset(self, namespace, table, partition) -> int: | |||
query = f"select max(redpanda.offset) from {namespace}.{self.escape_identifier(table)} where redpanda.partition={partition}" | |||
with self.run_query(query) as cursor: | |||
return cursor.fetchone()[0] | |||
|
|||
def run_sample_maintenance_task(self, namespace, table) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure what _sample_
in the name refers to
Backports Required
Release Notes